Python is written in C (actually the default implementation is called CPython)
# -< comment
print('Hello World!')
Hello World!
'Hello World!'
'Hello World!'
Types of Python objects:
Integers can be negative or positive numbers. Floats are real numbers (a superset of integers, including numbers with decimals)
Check type of a variable with the type() function:
# get the type
type(1.11)
float
#convert int and string to float:
float(55)
55.0
float('1.15')
1.15
#convert integer or float to string:
str(33)
'33'
#convert to string
str(1.85)
'1.85'
#Convert int to bool:
bool(1)
True
#Convert bool to float:
float(True)
1.0
# Sum, multiplication, division of integers
45+55+90
190
#multiply
3*5
15
#divide
99//11
9
# Division of floats
99/11
9.0
# Store value into variable
x=85
# Pring value of variable
x
85
#print
print(x)
85
# String concatenation
y = 'Michael Jordan is ' + 'the best'
# String slises
y[4:7]
'ael'
# Get every second element in the string
y[::2]
'McalJra stebs'
# Get every second element in range(0 to 4)
y[0:4:2]
'Mc'
# New line escape sequence
print('Michael Jordan is \n the best')
Michael Jordan is the best
# Tab excape sequence
print('Michael Jordan is \t the best')
Michael Jordan is the best
# Backslash as string
print('Michael Jordan is \\ the best')
Michael Jordan is \ the best
# Replace all letters with uppercase with upper() method
y.upper()
'MICHAEL JORDAN IS THE BEST'
# Replace old substring with new string with replace() method
y.replace('Michael', 'Goshko')
'Goshko Jordan is the best'
# Find substring in the string with find() method. Shows the index of the first element of the substring. Negative output if the substring is not present.
y.find('is')
15
#find in list
y.find('bla bla')
-1
#concatenate strings
y = 'Michael Jordan is ' + 'the best'
y
'Michael Jordan is the best'
# python is a calculator
n = 3*7 + 2
n
23
#f strings
f"M.Jordan's number is {n}"
"M.Jordan's number is 23"
# Condition Equal, Inequality, Greater, Lower than
5==6
False
# boolean comparison
5!=6
True
# boolean comparison
5<6
True
# boolean comparison
5>6
False
# Compare strings
'ABCD' == "BDCA"
False
# boolean comparison
'ABCD' != "BDCA"
True
#creating a variable
age = 25
# Branching in Python
if age > 18:
print("you can enter" )
else:
print("you can't" )
you can enter
# elif statement
if age > 18:
print("you can enter" )
elif age > 21:
print("some statement here")
else:
print("you can't enter" )
you can enter
# Logical operators AND, OR, NOT
if age >18 and age<21:
print("You can enter, but you can't drink alcohol")
#simple if or
if age<20 or age>30:
print("You are not in your 20s")
else:
print("You are in your 20s")
You are in your 20s
# if not
if not age==18:
print('you are not 18 years old')
you are not 18 years old
#range and for loop
for i in range(3):
print(i)
0 1 2
# while loop
i=0
while(i!=5):
print(i)
i+=1
0 1 2 3 4
# list comprehensions
[i for i in range(3)]
[0, 1, 2]
#functions
def addition(a):
b = a + 1
print(a, "add 1 to a:", b)
return(b)
# call the function
addition(6)
6 add 1 to a: 7
7
# Return values
def MJ():
print('Michael Jackson')
#call without argument
MJ()
Michael Jackson
#again no argument
def MJ1():
print('Michael Jackson')
return(None)
MJ1()
Michael Jackson
# Create a class
class Woman:
def __init__(self, age, hair, breasts ):
self.age = age
self.hair = hair
self.breasts = breasts
#create class
Wife = Woman(43, 'dark', 'small')
#create class
Lover = Woman(19, 'blond', 'big')
#check classes
Wife.age > Lover.age
True
#check classes
if Wife.breasts != 'big':
print('She has a nice personality!')
elif Wife.breasts == 'big':
print('Niiiiiiiiiiiiiice!')
She has a nice personality!
#create def for class
def breast_size(Wife):
if Wife.breasts != 'big':
print('She has a nice personality!')
elif Wife.breasts == 'big':
print('Niiiiiiiiiiiiiice! GJ Bro')
#return an expression based on value
breast_size(Lover)
Niiiiiiiiiiiiiice! GJ Bro
# Create a class
class Woman_br:
def __init__(self, age, hair, breasts ):
#function inside of class function
def nice_or_nah(self):
if self.breasts != 'big':
grade = 'She has a nice personality!'
if self.breasts == 'big':
grade = 'Niiiiiiiiiiiiiice! GJ Bro'
return grade
self.age = age
self.hair = hair
self.breasts = breasts
self.grade = nice_or_nah(self)
W = Woman_br(23,'amber', 'small')
W.grade
'She has a nice personality!'
Numpy solves one essential problem very fast: array processing
It is also very fast with problems that are naturally vectorized, like matrix multiplication and linear algebra routines, vectors generation and application of fixed transformation over entire array.
import numpy as np
x = np.random.uniform(0, 1, size=1000000)
x
array([0.17876904, 0.43982775, 0.42300266, ..., 0.30784663, 0.94431355,
0.5592932 ])
x.mean()
0.5000317244874024
x
array([0.17876904, 0.43982775, 0.42300266, ..., 0.30784663, 0.94431355,
0.5592932 ])
a = np.zeros(3)
a
array([0., 0., 0.])
type(a)
numpy.ndarray
a.shape
(3,)
z = np.zeros(4)
z
array([0., 0., 0., 0.])
z.shape = (2, 2)
z
array([[0., 0.],
[0., 0.]])
z = np.empty(3)
z
array([0., 0., 0.])
z = np.linspace(2, 4, 5)
z
array([2. , 2.5, 3. , 3.5, 4. ])
z = np.identity(3)
z
array([[1., 0., 0.],
[0., 1., 0.],
[0., 0., 1.]])
list = [10,20]
z = np.array(list)
z
array([10, 20])
z = np.linspace(1, 2, 5)
z
array([1. , 1.25, 1.5 , 1.75, 2. ])
print(z[0])
1.0
print(z[1])
1.25
print(z[-1])
2.0
z = np.array([[1, 2], [3, 4]])
z
array([[1, 2],
[3, 4]])
z[0, 0]
1
z[1, 0]
3
z[0, :]
array([1, 2])
z[:, 1]
array([2, 4])
a = np.array((4, 3, 2, 1))
a
array([4, 3, 2, 1])
a.mean()
2.5
a.sum()
10
a.sort()
a
array([1, 2, 3, 4])
a.max()
4
# return index of max element
a.argmax()
3
# cumulative sum of a elements
a.cumsum()
array([ 1, 3, 6, 10], dtype=int32)
a.cumprod()
array([ 1, 2, 6, 24], dtype=int32)
a.var()
1.25
a.std()
1.118033988749895
a.T
array([1, 2, 3, 4])
#use functions above from the np namespace
np.mean(a)
2.5
a = np.array([1, 2, 3, 4])
b = np.array([5, 6, 7, 8])
a + b
array([ 6, 8, 10, 12])
a * b
array([ 5, 12, 21, 32])
a+10
array([11, 12, 13, 14])
a * 10
array([10, 20, 30, 40])
A = np.ones((2, 2))
B = np.ones((2, 2))
A
array([[1., 1.],
[1., 1.]])
B
array([[1., 1.],
[1., 1.]])
A+B
array([[2., 2.],
[2., 2.]])
A + 10
array([[11., 11.],
[11., 11.]])
# Element-wise product, not matrix product
A * B
array([[1., 1.],
[1., 1.]])
# Matrix multiplication
A @ B
array([[2., 2.],
[2., 2.]])
z = np.array([1, 2, 3])
z
array([1, 2, 3])
np.sin(z)
array([0.84147098, 0.90929743, 0.14112001])
x = np.random.randn(4)
x
array([ 0.77530913, 1.41924919, 0.93456831, -0.54786971])
np.where(x > 0, 1, 0) # Insert 1 if x > 0 true, otherwise 0
array([1, 1, 1, 0])
x
array([ 0.77530913, 1.41924919, 0.93456831, -0.54786971])
# Vectorizing a user-defined function
def f2(x):
if x > 0:
return 1
else:
return 0
f = np.vectorize(f2)
f(x)
array([1, 1, 1, 0])
# Comparisons are done element-wise
z = np.array([2, 3])
y = np.array([2, 3])
z == y
array([ True, True])
z != y
array([False, False])
z > 3
array([False, False])
b = z > 3
b
array([False, False])
Pandas is a module for fast and efficient data analysis tools in Python.
import pandas as pd
{'col1': [1, 2], 'col2': [3, 4]}
{'col1': [1, 2], 'col2': [3, 4]}
pd.DataFrame({'col1': [1, 2], 'col2': [3, 4]})
| col1 | col2 | |
|---|---|---|
| 0 | 1 | 3 |
| 1 | 2 | 4 |
df = pd.read_excel('co1.xlsx')
df_path = pd.read_excel(r'C:\Users\Admin\Downloads\Revenue per server data.xlsx', sheet_name = 'Raw monthly')
import yfinance as yf
stocks = ['AMZN', 'AAPL', 'MSFT', 'GOOG','BTC-USD']
data = yf.download(stocks,
start="2021-01-01",
end="2021-07-01",
interval ='1d').dropna()['Adj Close']
[*********************100%***********************] 5 of 5 completed
data.to_excel('Financial_data.xlsx')
data
| AAPL | AMZN | BTC-USD | GOOG | MSFT | |
|---|---|---|---|---|---|
| Date | |||||
| 2020-12-31 | 132.267349 | 3256.929932 | 29001.720703 | 1751.880005 | 221.397675 |
| 2021-01-04 | 128.997803 | 3186.629883 | 31971.914062 | 1728.239990 | 216.689423 |
| 2021-01-05 | 130.592697 | 3218.510010 | 33992.429688 | 1740.920044 | 216.898438 |
| 2021-01-06 | 126.196747 | 3138.379883 | 36824.363281 | 1735.290039 | 211.274414 |
| 2021-01-07 | 130.502991 | 3162.159912 | 39371.042969 | 1787.250000 | 217.286652 |
| ... | ... | ... | ... | ... | ... |
| 2021-06-24 | 133.410004 | 3449.080078 | 34662.437500 | 2545.639893 | 266.690002 |
| 2021-06-25 | 133.110001 | 3401.459961 | 31637.779297 | 2539.899902 | 265.019989 |
| 2021-06-28 | 134.779999 | 3443.889893 | 34434.335938 | 2536.389893 | 268.720001 |
| 2021-06-29 | 136.330002 | 3448.139893 | 35867.777344 | 2520.370117 | 271.399994 |
| 2021-06-30 | 136.960007 | 3440.159912 | 35040.835938 | 2506.320068 | 270.899994 |
125 rows × 5 columns
data.index
DatetimeIndex(['2020-12-31', '2021-01-04', '2021-01-05', '2021-01-06',
'2021-01-07', '2021-01-08', '2021-01-11', '2021-01-12',
'2021-01-13', '2021-01-14',
...
'2021-06-17', '2021-06-18', '2021-06-21', '2021-06-22',
'2021-06-23', '2021-06-24', '2021-06-25', '2021-06-28',
'2021-06-29', '2021-06-30'],
dtype='datetime64[ns]', name='Date', length=125, freq=None)
data.columns
Index(['AAPL', 'AMZN', 'BTC-USD', 'GOOG', 'MSFT'], dtype='object')
data.head()
| AAPL | AMZN | BTC-USD | GOOG | MSFT | |
|---|---|---|---|---|---|
| Date | |||||
| 2020-12-31 | 132.267349 | 3256.929932 | 29001.720703 | 1751.880005 | 221.397675 |
| 2021-01-04 | 128.997803 | 3186.629883 | 31971.914062 | 1728.239990 | 216.689423 |
| 2021-01-05 | 130.592697 | 3218.510010 | 33992.429688 | 1740.920044 | 216.898438 |
| 2021-01-06 | 126.196747 | 3138.379883 | 36824.363281 | 1735.290039 | 211.274414 |
| 2021-01-07 | 130.502991 | 3162.159912 | 39371.042969 | 1787.250000 | 217.286652 |
data.tail()
| AAPL | AMZN | BTC-USD | GOOG | MSFT | |
|---|---|---|---|---|---|
| Date | |||||
| 2021-06-24 | 133.410004 | 3449.080078 | 34662.437500 | 2545.639893 | 266.690002 |
| 2021-06-25 | 133.110001 | 3401.459961 | 31637.779297 | 2539.899902 | 265.019989 |
| 2021-06-28 | 134.779999 | 3443.889893 | 34434.335938 | 2536.389893 | 268.720001 |
| 2021-06-29 | 136.330002 | 3448.139893 | 35867.777344 | 2520.370117 | 271.399994 |
| 2021-06-30 | 136.960007 | 3440.159912 | 35040.835938 | 2506.320068 | 270.899994 |
#select rows and columns using their integer indexes
data.iloc[:, :]
| AAPL | AMZN | BTC-USD | GOOG | MSFT | |
|---|---|---|---|---|---|
| Date | |||||
| 2020-12-31 | 132.267349 | 3256.929932 | 29001.720703 | 1751.880005 | 221.397675 |
| 2021-01-04 | 128.997803 | 3186.629883 | 31971.914062 | 1728.239990 | 216.689423 |
| 2021-01-05 | 130.592697 | 3218.510010 | 33992.429688 | 1740.920044 | 216.898438 |
| 2021-01-06 | 126.196747 | 3138.379883 | 36824.363281 | 1735.290039 | 211.274414 |
| 2021-01-07 | 130.502991 | 3162.159912 | 39371.042969 | 1787.250000 | 217.286652 |
| ... | ... | ... | ... | ... | ... |
| 2021-06-24 | 133.410004 | 3449.080078 | 34662.437500 | 2545.639893 | 266.690002 |
| 2021-06-25 | 133.110001 | 3401.459961 | 31637.779297 | 2539.899902 | 265.019989 |
| 2021-06-28 | 134.779999 | 3443.889893 | 34434.335938 | 2536.389893 | 268.720001 |
| 2021-06-29 | 136.330002 | 3448.139893 | 35867.777344 | 2520.370117 | 271.399994 |
| 2021-06-30 | 136.960007 | 3440.159912 | 35040.835938 | 2506.320068 | 270.899994 |
125 rows × 5 columns
data.iloc[:5, :]
| AAPL | AMZN | BTC-USD | GOOG | MSFT | |
|---|---|---|---|---|---|
| Date | |||||
| 2020-12-31 | 132.267349 | 3256.929932 | 29001.720703 | 1751.880005 | 221.397675 |
| 2021-01-04 | 128.997803 | 3186.629883 | 31971.914062 | 1728.239990 | 216.689423 |
| 2021-01-05 | 130.592697 | 3218.510010 | 33992.429688 | 1740.920044 | 216.898438 |
| 2021-01-06 | 126.196747 | 3138.379883 | 36824.363281 | 1735.290039 | 211.274414 |
| 2021-01-07 | 130.502991 | 3162.159912 | 39371.042969 | 1787.250000 | 217.286652 |
data.iloc[-5:, :]
| AAPL | AMZN | BTC-USD | GOOG | MSFT | |
|---|---|---|---|---|---|
| Date | |||||
| 2021-06-24 | 133.410004 | 3449.080078 | 34662.437500 | 2545.639893 | 266.690002 |
| 2021-06-25 | 133.110001 | 3401.459961 | 31637.779297 | 2539.899902 | 265.019989 |
| 2021-06-28 | 134.779999 | 3443.889893 | 34434.335938 | 2536.389893 | 268.720001 |
| 2021-06-29 | 136.330002 | 3448.139893 | 35867.777344 | 2520.370117 | 271.399994 |
| 2021-06-30 | 136.960007 | 3440.159912 | 35040.835938 | 2506.320068 | 270.899994 |
data.iloc[:, 3:]
| GOOG | MSFT | |
|---|---|---|
| Date | ||
| 2020-12-31 | 1751.880005 | 221.397675 |
| 2021-01-04 | 1728.239990 | 216.689423 |
| 2021-01-05 | 1740.920044 | 216.898438 |
| 2021-01-06 | 1735.290039 | 211.274414 |
| 2021-01-07 | 1787.250000 | 217.286652 |
| ... | ... | ... |
| 2021-06-24 | 2545.639893 | 266.690002 |
| 2021-06-25 | 2539.899902 | 265.019989 |
| 2021-06-28 | 2536.389893 | 268.720001 |
| 2021-06-29 | 2520.370117 | 271.399994 |
| 2021-06-30 | 2506.320068 | 270.899994 |
125 rows × 2 columns
data.iloc[:, -1:]
| MSFT | |
|---|---|
| Date | |
| 2020-12-31 | 221.397675 |
| 2021-01-04 | 216.689423 |
| 2021-01-05 | 216.898438 |
| 2021-01-06 | 211.274414 |
| 2021-01-07 | 217.286652 |
| ... | ... |
| 2021-06-24 | 266.690002 |
| 2021-06-25 | 265.019989 |
| 2021-06-28 | 268.720001 |
| 2021-06-29 | 271.399994 |
| 2021-06-30 | 270.899994 |
125 rows × 1 columns
#select certain columns from the dataframe
data[['AAPL','AMZN']]
| AAPL | AMZN | |
|---|---|---|
| Date | ||
| 2020-12-31 | 132.267349 | 3256.929932 |
| 2021-01-04 | 128.997803 | 3186.629883 |
| 2021-01-05 | 130.592697 | 3218.510010 |
| 2021-01-06 | 126.196747 | 3138.379883 |
| 2021-01-07 | 130.502991 | 3162.159912 |
| ... | ... | ... |
| 2021-06-24 | 133.410004 | 3449.080078 |
| 2021-06-25 | 133.110001 | 3401.459961 |
| 2021-06-28 | 134.779999 | 3443.889893 |
| 2021-06-29 | 136.330002 | 3448.139893 |
| 2021-06-30 | 136.960007 | 3440.159912 |
125 rows × 2 columns
# create new dataframe from the selected columns
df2 = data[['AAPL','AMZN']]
# rename columns
df3 = df2.rename(columns={'AAPL': 'apple','AMZN': 'Jeff Bezos Shopping Centre'})
df3
| apple | Jeff Bezos Shopping Centre | |
|---|---|---|
| Date | ||
| 2020-12-31 | 132.267349 | 3256.929932 |
| 2021-01-04 | 128.997803 | 3186.629883 |
| 2021-01-05 | 130.592697 | 3218.510010 |
| 2021-01-06 | 126.196747 | 3138.379883 |
| 2021-01-07 | 130.502991 | 3162.159912 |
| ... | ... | ... |
| 2021-06-24 | 133.410004 | 3449.080078 |
| 2021-06-25 | 133.110001 | 3401.459961 |
| 2021-06-28 | 134.779999 | 3443.889893 |
| 2021-06-29 | 136.330002 | 3448.139893 |
| 2021-06-30 | 136.960007 | 3440.159912 |
125 rows × 2 columns
data['AMZN_Discount'] = data['AMZN'] * 0.9 # column multiplication
data
| AAPL | AMZN | BTC-USD | GOOG | MSFT | AMZN_Discount | |
|---|---|---|---|---|---|---|
| Date | ||||||
| 2020-12-31 | 132.267349 | 3256.929932 | 29001.720703 | 1751.880005 | 221.397675 | 2931.236938 |
| 2021-01-04 | 128.997803 | 3186.629883 | 31971.914062 | 1728.239990 | 216.689423 | 2867.966895 |
| 2021-01-05 | 130.592697 | 3218.510010 | 33992.429688 | 1740.920044 | 216.898438 | 2896.659009 |
| 2021-01-06 | 126.196747 | 3138.379883 | 36824.363281 | 1735.290039 | 211.274414 | 2824.541895 |
| 2021-01-07 | 130.502991 | 3162.159912 | 39371.042969 | 1787.250000 | 217.286652 | 2845.943921 |
| ... | ... | ... | ... | ... | ... | ... |
| 2021-06-24 | 133.410004 | 3449.080078 | 34662.437500 | 2545.639893 | 266.690002 | 3104.172070 |
| 2021-06-25 | 133.110001 | 3401.459961 | 31637.779297 | 2539.899902 | 265.019989 | 3061.313965 |
| 2021-06-28 | 134.779999 | 3443.889893 | 34434.335938 | 2536.389893 | 268.720001 | 3099.500903 |
| 2021-06-29 | 136.330002 | 3448.139893 | 35867.777344 | 2520.370117 | 271.399994 | 3103.325903 |
| 2021-06-30 | 136.960007 | 3440.159912 | 35040.835938 | 2506.320068 | 270.899994 | 3096.143921 |
125 rows × 6 columns
data['BTC/AMZN'] = data['BTC-USD'] / data['AMZN']
data
| AAPL | AMZN | BTC-USD | GOOG | MSFT | AMZN_Discount | BTC/AMZN | |
|---|---|---|---|---|---|---|---|
| Date | |||||||
| 2020-12-31 | 132.267349 | 3256.929932 | 29001.720703 | 1751.880005 | 221.397675 | 2931.236938 | 8.904619 |
| 2021-01-04 | 128.997803 | 3186.629883 | 31971.914062 | 1728.239990 | 216.689423 | 2867.966895 | 10.033143 |
| 2021-01-05 | 130.592697 | 3218.510010 | 33992.429688 | 1740.920044 | 216.898438 | 2896.659009 | 10.561542 |
| 2021-01-06 | 126.196747 | 3138.379883 | 36824.363281 | 1735.290039 | 211.274414 | 2824.541895 | 11.733558 |
| 2021-01-07 | 130.502991 | 3162.159912 | 39371.042969 | 1787.250000 | 217.286652 | 2845.943921 | 12.450681 |
| ... | ... | ... | ... | ... | ... | ... | ... |
| 2021-06-24 | 133.410004 | 3449.080078 | 34662.437500 | 2545.639893 | 266.690002 | 3104.172070 | 10.049763 |
| 2021-06-25 | 133.110001 | 3401.459961 | 31637.779297 | 2539.899902 | 265.019989 | 3061.313965 | 9.301235 |
| 2021-06-28 | 134.779999 | 3443.889893 | 34434.335938 | 2536.389893 | 268.720001 | 3099.500903 | 9.998675 |
| 2021-06-29 | 136.330002 | 3448.139893 | 35867.777344 | 2520.370117 | 271.399994 | 3103.325903 | 10.402066 |
| 2021-06-30 | 136.960007 | 3440.159912 | 35040.835938 | 2506.320068 | 270.899994 | 3096.143921 | 10.185816 |
125 rows × 7 columns
data['Index'] = range(0,len(data))
data2 = data.set_index('Index')
data
| AAPL | AMZN | BTC-USD | GOOG | MSFT | AMZN_Discount | BTC/AMZN | Index | |
|---|---|---|---|---|---|---|---|---|
| Date | ||||||||
| 2020-12-31 | 132.267349 | 3256.929932 | 29001.720703 | 1751.880005 | 221.397675 | 2931.236938 | 8.904619 | 0 |
| 2021-01-04 | 128.997803 | 3186.629883 | 31971.914062 | 1728.239990 | 216.689423 | 2867.966895 | 10.033143 | 1 |
| 2021-01-05 | 130.592697 | 3218.510010 | 33992.429688 | 1740.920044 | 216.898438 | 2896.659009 | 10.561542 | 2 |
| 2021-01-06 | 126.196747 | 3138.379883 | 36824.363281 | 1735.290039 | 211.274414 | 2824.541895 | 11.733558 | 3 |
| 2021-01-07 | 130.502991 | 3162.159912 | 39371.042969 | 1787.250000 | 217.286652 | 2845.943921 | 12.450681 | 4 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 2021-06-24 | 133.410004 | 3449.080078 | 34662.437500 | 2545.639893 | 266.690002 | 3104.172070 | 10.049763 | 120 |
| 2021-06-25 | 133.110001 | 3401.459961 | 31637.779297 | 2539.899902 | 265.019989 | 3061.313965 | 9.301235 | 121 |
| 2021-06-28 | 134.779999 | 3443.889893 | 34434.335938 | 2536.389893 | 268.720001 | 3099.500903 | 9.998675 | 122 |
| 2021-06-29 | 136.330002 | 3448.139893 | 35867.777344 | 2520.370117 | 271.399994 | 3103.325903 | 10.402066 | 123 |
| 2021-06-30 | 136.960007 | 3440.159912 | 35040.835938 | 2506.320068 | 270.899994 | 3096.143921 | 10.185816 | 124 |
125 rows × 8 columns
data.set_index('Index', inplace=True)
data
| AAPL | AMZN | BTC-USD | GOOG | MSFT | AMZN_Discount | BTC/AMZN | |
|---|---|---|---|---|---|---|---|
| Index | |||||||
| 0 | 132.267349 | 3256.929932 | 29001.720703 | 1751.880005 | 221.397675 | 2931.236938 | 8.904619 |
| 1 | 128.997803 | 3186.629883 | 31971.914062 | 1728.239990 | 216.689423 | 2867.966895 | 10.033143 |
| 2 | 130.592697 | 3218.510010 | 33992.429688 | 1740.920044 | 216.898438 | 2896.659009 | 10.561542 |
| 3 | 126.196747 | 3138.379883 | 36824.363281 | 1735.290039 | 211.274414 | 2824.541895 | 11.733558 |
| 4 | 130.502991 | 3162.159912 | 39371.042969 | 1787.250000 | 217.286652 | 2845.943921 | 12.450681 |
| ... | ... | ... | ... | ... | ... | ... | ... |
| 120 | 133.410004 | 3449.080078 | 34662.437500 | 2545.639893 | 266.690002 | 3104.172070 | 10.049763 |
| 121 | 133.110001 | 3401.459961 | 31637.779297 | 2539.899902 | 265.019989 | 3061.313965 | 9.301235 |
| 122 | 134.779999 | 3443.889893 | 34434.335938 | 2536.389893 | 268.720001 | 3099.500903 | 9.998675 |
| 123 | 136.330002 | 3448.139893 | 35867.777344 | 2520.370117 | 271.399994 | 3103.325903 | 10.402066 |
| 124 | 136.960007 | 3440.159912 | 35040.835938 | 2506.320068 | 270.899994 | 3096.143921 | 10.185816 |
125 rows × 7 columns
data['AMZN'].mean()
3248.419505859375
data['AMZN'].std()
128.8898077172711
data.info()
<class 'pandas.core.frame.DataFrame'> Int64Index: 125 entries, 0 to 124 Data columns (total 7 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 AAPL 125 non-null float64 1 AMZN 125 non-null float64 2 BTC-USD 125 non-null float64 3 GOOG 125 non-null float64 4 MSFT 125 non-null float64 5 AMZN_Discount 125 non-null float64 6 BTC/AMZN 125 non-null float64 dtypes: float64(7) memory usage: 7.8 KB
data.describe()
| AAPL | AMZN | BTC-USD | GOOG | MSFT | AMZN_Discount | BTC/AMZN | |
|---|---|---|---|---|---|---|---|
| count | 125.000000 | 125.000000 | 125.000000 | 125.000000 | 125.000000 | 125.000000 | 125.000000 |
| mean | 128.812221 | 3248.419506 | 45987.991172 | 2183.645197 | 242.564961 | 2923.577555 | 14.205109 |
| std | 5.436991 | 128.889808 | 9919.581286 | 234.638402 | 14.188132 | 116.000827 | 3.212776 |
| min | 116.162689 | 2951.949951 | 29001.720703 | 1728.239990 | 211.274414 | 2656.754956 | 8.904619 |
| 25% | 125.059998 | 3151.939941 | 36825.367188 | 2045.060059 | 233.241425 | 2836.745947 | 11.312226 |
| 50% | 128.499405 | 3249.899902 | 47504.851562 | 2137.750000 | 243.119995 | 2924.909912 | 14.982480 |
| 75% | 133.253662 | 3340.879883 | 55724.265625 | 2381.350098 | 252.666565 | 3006.791895 | 17.095260 |
| max | 142.704010 | 3505.439941 | 63503.457031 | 2545.639893 | 271.399994 | 3154.895947 | 19.283829 |
data.corr()
| AAPL | AMZN | BTC-USD | GOOG | MSFT | AMZN_Discount | BTC/AMZN | |
|---|---|---|---|---|---|---|---|
| AAPL | 1.000000 | 0.757931 | -0.350447 | 0.042839 | 0.300985 | 0.757931 | -0.477725 |
| AMZN | 0.757931 | 1.000000 | -0.208388 | 0.577279 | 0.737247 | 1.000000 | -0.380158 |
| BTC-USD | -0.350447 | -0.208388 | 1.000000 | 0.066363 | 0.114908 | -0.208388 | 0.983020 |
| GOOG | 0.042839 | 0.577279 | 0.066363 | 1.000000 | 0.914961 | 0.577279 | -0.033493 |
| MSFT | 0.300985 | 0.737247 | 0.114908 | 0.914961 | 1.000000 | 0.737247 | -0.017608 |
| AMZN_Discount | 0.757931 | 1.000000 | -0.208388 | 0.577279 | 0.737247 | 1.000000 | -0.380158 |
| BTC/AMZN | -0.477725 | -0.380158 | 0.983020 | -0.033493 | -0.017608 | -0.380158 | 1.000000 |
data.cov()
| AAPL | AMZN | BTC-USD | GOOG | MSFT | AMZN_Discount | BTC/AMZN | |
|---|---|---|---|---|---|---|---|
| AAPL | 29.560876 | 531.137380 | -1.890052e+04 | 54.651372 | 23.218203 | 478.023642 | -8.344817 |
| AMZN | 531.137380 | 16612.582533 | -2.664314e+05 | 17458.366485 | 1348.208052 | 14951.324280 | -157.421207 |
| BTC-USD | -18900.522395 | -266431.381258 | 9.839809e+07 | 154461.986254 | 16172.239860 | -239788.243132 | 31328.256836 |
| GOOG | 54.651372 | 17458.366485 | 1.544620e+05 | 55055.179638 | 3045.977453 | 15712.529837 | -25.248392 |
| MSFT | 23.218203 | 1348.208052 | 1.617224e+04 | 3045.977453 | 201.303079 | 1213.387247 | -0.802630 |
| AMZN_Discount | 478.023642 | 14951.324280 | -2.397882e+05 | 15712.529837 | 1213.387247 | 13456.191852 | -141.679086 |
| BTC/AMZN | -8.344817 | -157.421207 | 3.132826e+04 | -25.248392 | -0.802630 | -141.679086 | 10.321928 |
data.corr() * data.cov()
| AAPL | AMZN | BTC-USD | GOOG | MSFT | AMZN_Discount | BTC/AMZN | |
|---|---|---|---|---|---|---|---|
| AAPL | 29.560876 | 402.565455 | 6.623623e+03 | 2.341232 | 6.988329 | 362.308909 | 3.986526 |
| AMZN | 402.565455 | 16612.582533 | 5.552120e+04 | 10078.352498 | 993.962611 | 14951.324280 | 59.844948 |
| BTC-USD | 6623.623247 | 55521.198989 | 9.839809e+07 | 10250.635665 | 1858.325550 | 49969.079090 | 30796.312163 |
| GOOG | 2.341232 | 10078.352498 | 1.025064e+04 | 55055.179638 | 2786.949294 | 9070.517249 | 0.845645 |
| MSFT | 6.988329 | 993.962611 | 1.858326e+03 | 2786.949294 | 201.303079 | 894.566350 | 0.014133 |
| AMZN_Discount | 362.308909 | 14951.324280 | 4.996908e+04 | 9070.517249 | 894.566350 | 13456.191852 | 53.860453 |
| BTC/AMZN | 3.986526 | 59.844948 | 3.079631e+04 | 0.845645 | 0.014133 | 53.860453 | 10.321928 |
pd.concat([data.corr(),data.cov()], axis=0)
| AAPL | AMZN | BTC-USD | GOOG | MSFT | AMZN_Discount | BTC/AMZN | |
|---|---|---|---|---|---|---|---|
| AAPL | 1.000000 | 0.757931 | -3.504466e-01 | 0.042839 | 0.300985 | 0.757931 | -0.477725 |
| AMZN | 0.757931 | 1.000000 | -2.083884e-01 | 0.577279 | 0.737247 | 1.000000 | -0.380158 |
| BTC-USD | -0.350447 | -0.208388 | 1.000000e+00 | 0.066363 | 0.114908 | -0.208388 | 0.983020 |
| GOOG | 0.042839 | 0.577279 | 6.636348e-02 | 1.000000 | 0.914961 | 0.577279 | -0.033493 |
| MSFT | 0.300985 | 0.737247 | 1.149084e-01 | 0.914961 | 1.000000 | 0.737247 | -0.017608 |
| AMZN_Discount | 0.757931 | 1.000000 | -2.083884e-01 | 0.577279 | 0.737247 | 1.000000 | -0.380158 |
| BTC/AMZN | -0.477725 | -0.380158 | 9.830203e-01 | -0.033493 | -0.017608 | -0.380158 | 1.000000 |
| AAPL | 29.560876 | 531.137380 | -1.890052e+04 | 54.651372 | 23.218203 | 478.023642 | -8.344817 |
| AMZN | 531.137380 | 16612.582533 | -2.664314e+05 | 17458.366485 | 1348.208052 | 14951.324280 | -157.421207 |
| BTC-USD | -18900.522395 | -266431.381258 | 9.839809e+07 | 154461.986254 | 16172.239860 | -239788.243132 | 31328.256836 |
| GOOG | 54.651372 | 17458.366485 | 1.544620e+05 | 55055.179638 | 3045.977453 | 15712.529837 | -25.248392 |
| MSFT | 23.218203 | 1348.208052 | 1.617224e+04 | 3045.977453 | 201.303079 | 1213.387247 | -0.802630 |
| AMZN_Discount | 478.023642 | 14951.324280 | -2.397882e+05 | 15712.529837 | 1213.387247 | 13456.191852 | -141.679086 |
| BTC/AMZN | -8.344817 | -157.421207 | 3.132826e+04 | -25.248392 | -0.802630 | -141.679086 | 10.321928 |
pd.concat([data.corr(),data.cov()], axis=1)
| AAPL | AMZN | BTC-USD | GOOG | MSFT | AMZN_Discount | BTC/AMZN | AAPL | AMZN | BTC-USD | GOOG | MSFT | AMZN_Discount | BTC/AMZN | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| AAPL | 1.000000 | 0.757931 | -0.350447 | 0.042839 | 0.300985 | 0.757931 | -0.477725 | 29.560876 | 531.137380 | -1.890052e+04 | 54.651372 | 23.218203 | 478.023642 | -8.344817 |
| AMZN | 0.757931 | 1.000000 | -0.208388 | 0.577279 | 0.737247 | 1.000000 | -0.380158 | 531.137380 | 16612.582533 | -2.664314e+05 | 17458.366485 | 1348.208052 | 14951.324280 | -157.421207 |
| BTC-USD | -0.350447 | -0.208388 | 1.000000 | 0.066363 | 0.114908 | -0.208388 | 0.983020 | -18900.522395 | -266431.381258 | 9.839809e+07 | 154461.986254 | 16172.239860 | -239788.243132 | 31328.256836 |
| GOOG | 0.042839 | 0.577279 | 0.066363 | 1.000000 | 0.914961 | 0.577279 | -0.033493 | 54.651372 | 17458.366485 | 1.544620e+05 | 55055.179638 | 3045.977453 | 15712.529837 | -25.248392 |
| MSFT | 0.300985 | 0.737247 | 0.114908 | 0.914961 | 1.000000 | 0.737247 | -0.017608 | 23.218203 | 1348.208052 | 1.617224e+04 | 3045.977453 | 201.303079 | 1213.387247 | -0.802630 |
| AMZN_Discount | 0.757931 | 1.000000 | -0.208388 | 0.577279 | 0.737247 | 1.000000 | -0.380158 | 478.023642 | 14951.324280 | -2.397882e+05 | 15712.529837 | 1213.387247 | 13456.191852 | -141.679086 |
| BTC/AMZN | -0.477725 | -0.380158 | 0.983020 | -0.033493 | -0.017608 | -0.380158 | 1.000000 | -8.344817 | -157.421207 | 3.132826e+04 | -25.248392 | -0.802630 | -141.679086 | 10.321928 |
data['BTC-USD'].plot(kind='line')
<AxesSubplot:xlabel='Index'>
data['BTC_MA'] = data['BTC-USD'].rolling(10).mean()
data[['BTC-USD','BTC_MA']].plot(kind='line')
<AxesSubplot:xlabel='Index'>
from sklearn.metrics import r2_score
r2_score(data['BTC-USD'].iloc[10:], data['BTC_MA'].iloc[10:])
0.8184272450110561
import plotly.express as px
volatility = pd.DataFrame(data[stocks].std(), columns = ['Volatility'])
volatility['Stock'] = volatility.index
volatility
| Volatility | Stock | |
|---|---|---|
| AMZN | 128.889808 | AMZN |
| AAPL | 5.436991 | AAPL |
| MSFT | 14.188132 | MSFT |
| GOOG | 234.638402 | GOOG |
| BTC-USD | 9919.581286 | BTC-USD |
fig = px.bar(volatility, x='Stock', y='Volatility', color = 'Stock')
fig.show()
fig = px.scatter(data, x="AAPL", y="MSFT")
fig.show()
fig = px.scatter(data, x="AAPL", y="MSFT", trendline="ols")
fig.show()
results = px.get_trendline_results(fig)
results.px_fit_results.iloc[0].summary()
| Dep. Variable: | y | R-squared: | 0.091 |
|---|---|---|---|
| Model: | OLS | Adj. R-squared: | 0.083 |
| Method: | Least Squares | F-statistic: | 12.25 |
| Date: | Sun, 04 Jul 2021 | Prob (F-statistic): | 0.000648 |
| Time: | 16:55:33 | Log-Likelihood: | -502.48 |
| No. Observations: | 125 | AIC: | 1009. |
| Df Residuals: | 123 | BIC: | 1015. |
| Df Model: | 1 | ||
| Covariance Type: | nonrobust |
| coef | std err | t | P>|t| | [0.025 | 0.975] | |
|---|---|---|---|---|---|---|
| const | 141.3911 | 28.929 | 4.888 | 0.000 | 84.128 | 198.654 |
| x1 | 0.7854 | 0.224 | 3.500 | 0.001 | 0.341 | 1.230 |
| Omnibus: | 8.628 | Durbin-Watson: | 1.687 |
|---|---|---|---|
| Prob(Omnibus): | 0.013 | Jarque-Bera (JB): | 9.160 |
| Skew: | -0.643 | Prob(JB): | 0.0103 |
| Kurtosis: | 2.675 | Cond. No. | 3.07e+03 |
fig = px.histogram(data, x="BTC-USD")
fig.show()
df = px.data.tips()
df
| total_bill | tip | sex | smoker | day | time | size | |
|---|---|---|---|---|---|---|---|
| 0 | 16.99 | 1.01 | Female | No | Sun | Dinner | 2 |
| 1 | 10.34 | 1.66 | Male | No | Sun | Dinner | 3 |
| 2 | 21.01 | 3.50 | Male | No | Sun | Dinner | 3 |
| 3 | 23.68 | 3.31 | Male | No | Sun | Dinner | 2 |
| 4 | 24.59 | 3.61 | Female | No | Sun | Dinner | 4 |
| ... | ... | ... | ... | ... | ... | ... | ... |
| 239 | 29.03 | 5.92 | Male | No | Sat | Dinner | 3 |
| 240 | 27.18 | 2.00 | Female | Yes | Sat | Dinner | 2 |
| 241 | 22.67 | 2.00 | Male | Yes | Sat | Dinner | 2 |
| 242 | 17.82 | 1.75 | Male | No | Sat | Dinner | 2 |
| 243 | 18.78 | 3.00 | Female | No | Thur | Dinner | 2 |
244 rows × 7 columns
fig = px.histogram(df, x="total_bill", color="sex")
fig.show()
df = px.data.iris()
df
| sepal_length | sepal_width | petal_length | petal_width | species | species_id | |
|---|---|---|---|---|---|---|
| 0 | 5.1 | 3.5 | 1.4 | 0.2 | setosa | 1 |
| 1 | 4.9 | 3.0 | 1.4 | 0.2 | setosa | 1 |
| 2 | 4.7 | 3.2 | 1.3 | 0.2 | setosa | 1 |
| 3 | 4.6 | 3.1 | 1.5 | 0.2 | setosa | 1 |
| 4 | 5.0 | 3.6 | 1.4 | 0.2 | setosa | 1 |
| ... | ... | ... | ... | ... | ... | ... |
| 145 | 6.7 | 3.0 | 5.2 | 2.3 | virginica | 3 |
| 146 | 6.3 | 2.5 | 5.0 | 1.9 | virginica | 3 |
| 147 | 6.5 | 3.0 | 5.2 | 2.0 | virginica | 3 |
| 148 | 6.2 | 3.4 | 5.4 | 2.3 | virginica | 3 |
| 149 | 5.9 | 3.0 | 5.1 | 1.8 | virginica | 3 |
150 rows × 6 columns
fig = px.scatter_3d(df, x='sepal_length', y='sepal_width', z='petal_width',
color='species')
fig.show()
import numpy as np
import plotly.express as px
import plotly.graph_objects as go
from sklearn.svm import SVR #(support vector machines)
mesh_size = .02
margin = 0
df = px.data.iris()
X = df[['sepal_width', 'sepal_length']]
y = df['petal_width']
# Condition the model on sepal width and length, predict the petal width
model = SVR(C=1.)
model.fit(X, y)
# Create a mesh grid on which we will run our model
x_min, x_max = X.sepal_width.min() - margin, X.sepal_width.max() + margin
y_min, y_max = X.sepal_length.min() - margin, X.sepal_length.max() + margin
xrange = np.arange(x_min, x_max, mesh_size)
yrange = np.arange(y_min, y_max, mesh_size)
xx, yy = np.meshgrid(xrange, yrange)
# Run model
pred = model.predict(np.c_[xx.ravel(), yy.ravel()])
pred = pred.reshape(xx.shape)
# Generate the plot
fig = px.scatter_3d(df, x='sepal_width', y='sepal_length', z='petal_width', color='species')
fig.update_traces(marker=dict(size=5))
fig.add_traces(go.Surface(x=xrange, y=yrange, z=pred, name='pred_surface'))
fig.show()
df1 = pd.read_excel('phd_count_2000-2017.xlsx')
D:\anaconda3\envs\Binance\lib\site-packages\openpyxl\styles\stylesheet.py:221: UserWarning: Workbook contains no default style, apply openpyxl's default
columns = [year[:4] for year in df1.iloc[1:2,1:].values[0]]
df_new = df1.iloc[2:,:].dropna()
df_new.columns = ['Date']+columns
df_t = df_new.T
df_t.columns = df_t.iloc[:1,:].values[0]
df_final_2000_2016 = df_t.iloc[1:,:]
def clean_data(df1):
columns = [year[:4] for year in df1.iloc[1:2,1:].values[0]]
df_new = df1.iloc[2:,:].dropna()
df_new.columns = ['Date']+columns
df_t = df_new.T
df_t.columns = df_t.iloc[:1,:].values[0]
df_final_2000_2016 = df_t.iloc[1:,:]
return df_final_2000_2016
df2 = pd.read_excel('phd_count_2017-2021.xlsx')
D:\anaconda3\envs\Binance\lib\site-packages\openpyxl\styles\stylesheet.py:221: UserWarning: Workbook contains no default style, apply openpyxl's default
columns = [year[:4] for year in df2.iloc[1:2,1:].values[0]]
df_new = df2.iloc[2:,:].dropna()
df_new.columns = ['Date']+columns
df_t = df_new.T
df_t.columns = df_t.iloc[:1,:].values[0]
df_final_2017_2020 = df_t.iloc[1:,:]
df_final = pd.concat([df_final_2000_2016, df_final_2017_2020], axis = 0).fillna(0)
df_final.reset_index(drop=False, inplace = True)
px.scatter(df_final, x=df_final.index, y="Общо", trendline = 'ols')
df_final
| index | Общо | Подготовка на учители и науки за образованието | Изкуства | Хуманитарни науки | Науки за обществото и човешкото поведение | Журналистика, масова комуникация и информация | Стопански науки и администрация | Право | Природни науки | ... | Технически науки, производство и строителство, интердисциплинарни програми и квалификации | Селско стопанство | Горско стопанство | Рибно стопанство | Социални услуги | Здравеопазване и социални услуги, интердисциплинарни програми и квалификации | Персонални услуги | Комунално-санитарни услуги, здраве и безопасност при работа | Сигурност | Транспортни услуги | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 2000 | 3414 | 160 | 113 | 489 | 143 | 10 | 441 | 59 | 134 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 1 | 2001 | 3998 | 360 | 155 | 504 | 184 | 7 | 542 | 81 | 174 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 2 | 2002 | 4440 | 358 | 172 | 560 | 207 | 13 | 607 | 93 | 154 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 3 | 2003 | 4834 | 270 | 185 | 642 | 262 | 7 | 657 | 158 | 144 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 4 | 2004 | 5079 | 416 | 205 | 595 | 259 | 15 | 631 | 126 | 150 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 5 | 2005 | 5163 | 416 | 204 | 544 | 214 | 18 | 716 | 146 | 161 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 6 | 2006 | 4816 | 355 | 195 | 523 | 201 | 22 | 550 | 168 | 212 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 7 | 2007 | 4353 | 321 | 149 | 414 | 526 | 75 | 262 | 152 | 214 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 8 | 2008 | 3949 | 276 | 165 | 397 | 461 | 63 | 242 | 149 | 194 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 9 | 2009 | 3850 | 252 | 173 | 408 | 440 | 59 | 257 | 147 | 175 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 10 | 2010 | 4095 | 284 | 165 | 430 | 455 | 88 | 308 | 152 | 180 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 11 | 2011 | 4703 | 341 | 260 | 441 | 542 | 109 | 359 | 149 | 190 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 12 | 2012 | 5371 | 390 | 274 | 518 | 627 | 112 | 394 | 149 | 225 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 13 | 2013 | 6055 | 479 | 278 | 594 | 728 | 136 | 492 | 159 | 245 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 14 | 2014 | 6617 | 567 | 354 | 647 | 809 | 141 | 546 | 178 | 270 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 15 | 2015 | 6750 | 637 | 329 | 621 | 892 | 147 | 541 | 214 | 256 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 16 | 2016 | 6738 | 612 | 330 | 677 | 946 | 177 | 544 | 205 | 301 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 17 | 2017 | 6564 | 0 | 342 | 0 | 964 | 0 | 0 | 222 | 0 | ... | 7 | 101 | 27 | 3 | 23 | 90 | - | 19 | 107 | 45 |
| 18 | 2018 | 6467 | 0 | 357 | 0 | 922 | 0 | 0 | 242 | 0 | ... | 8 | 107 | 29 | 3 | 36 | 92 | - | 24 | 94 | 53 |
| 19 | 2019 | 6440 | 0 | 358 | 0 | 932 | 0 | 0 | 292 | 0 | ... | 8 | 101 | 29 | 2 | 33 | 98 | - | 14 | 122 | 50 |
| 20 | 2020 | 6570 | 0 | 352 | 0 | 901 | 0 | 0 | 336 | 0 | ... | 16 | 97 | 26 | 2 | 35 | 102 | 14 | 26 | 126 | 49 |
21 rows × 43 columns
df3 = pd.read_excel('phd_finish 2000-2017.xlsx')
df3
D:\anaconda3\envs\Binance\lib\site-packages\openpyxl\styles\stylesheet.py:221: UserWarning: Workbook contains no default style, apply openpyxl's default
| ЗАВЪРШИЛИ С ОБРАЗОВАТЕЛНА И НАУЧНА СТЕПЕН „ДОКТОР" ПО ПОЛ И ОБЛАСТ НА ОБРАЗОВАНИЕТО (КОО-2008) - ДАННИ ОТ 2001 ДО 2016 ГОДИНА | Unnamed: 1 | Unnamed: 2 | Unnamed: 3 | Unnamed: 4 | Unnamed: 5 | Unnamed: 6 | Unnamed: 7 | Unnamed: 8 | Unnamed: 9 | Unnamed: 10 | Unnamed: 11 | Unnamed: 12 | Unnamed: 13 | Unnamed: 14 | Unnamed: 15 | Unnamed: 16 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | NaN | (брой) | (брой) | (брой) | (брой) | (брой) | (брой) | (брой) | (брой) | (брой) | (брой) | (брой) | (брой) | (брой) | (брой) | (брой) | (брой) |
| 1 | NaN | 2001 | 2002 | 2003 | 2004 | 2005 | 2006 | 2007 | 2008 | 2009 | 2010 | 2011 | 2012 | 2013 | 2014 | 2015 | 2016 |
| 2 | NaN | Общо | Общо | Общо | Общо | Общо | Общо | Общо | Общо | Общо | Общо | Общо | Общо | Общо | Общо | Общо | Общо |
| 3 | Общо | 376 | 385 | 401 | 392 | 528 | 583 | 621 | 601 | 636 | 596 | 638 | 979 | 1202 | 1363 | 1442 | 1464 |
| 4 | Подготовка на учители и науки за образованието | 32 | 27 | 23 | 38 | 39 | 44 | 48 | 55 | 45 | 32 | 52 | 98 | 106 | 138 | 113 | 136 |
| 5 | Изкуства | 5 | 21 | 12 | 12 | 13 | 29 | 31 | 33 | 30 | 42 | 33 | 72 | 92 | 90 | 89 | 87 |
| 6 | Хуманитарни науки | 61 | 68 | 67 | 61 | 76 | 73 | 61 | 60 | 77 | 68 | 53 | 107 | 106 | 93 | 125 | 146 |
| 7 | Науки за обществото и човешкото поведение | 8 | 13 | 11 | 12 | 22 | 19 | 79 | 68 | 67 | 60 | 78 | 109 | 113 | 149 | 148 | 144 |
| 8 | Журналистика, масова комуникация и информация | - | 1 | - | - | - | - | 6 | 9 | 7 | 6 | 15 | 23 | 22 | 28 | 19 | 18 |
| 9 | Стопански науки и администрация | 45 | 37 | 50 | 50 | 74 | 65 | 32 | 19 | 26 | 38 | 40 | 71 | 95 | 101 | 119 | 95 |
| 10 | Право | 9 | 7 | 3 | 1 | 4 | 15 | 18 | 26 | 19 | 28 | 22 | 37 | 33 | 42 | 40 | 42 |
| 11 | Природни науки | 5 | 21 | 18 | 18 | 17 | 34 | 43 | 36 | 45 | 38 | 47 | 50 | 48 | 71 | 83 | 65 |
| 12 | Физически и химически науки | 46 | 34 | 61 | 46 | 62 | 39 | 51 | 66 | 72 | 37 | 47 | 68 | 73 | 76 | 94 | 91 |
| 13 | Математика и статистика | 17 | 8 | 10 | 13 | 10 | 13 | 11 | 10 | 9 | 9 | 10 | 13 | 19 | 27 | 14 | 20 |
| 14 | Информатика | - | - | - | - | - | - | 5 | 8 | 5 | 7 | 15 | 24 | 25 | 33 | 36 | 18 |
| 15 | Технически науки и технически професии | 50 | 46 | 39 | 55 | 65 | 107 | 95 | 86 | 104 | 112 | 86 | 103 | 157 | 172 | 166 | 161 |
| 16 | Добив и производствени технологии | 3 | 7 | 7 | 11 | 12 | 10 | 17 | 6 | 8 | 10 | 21 | 17 | 31 | 37 | 34 | 31 |
| 17 | Архитектура и строителство | 5 | 6 | 5 | 8 | 8 | 9 | 3 | 11 | 10 | 8 | 12 | 25 | 24 | 40 | 45 | 32 |
| 18 | Аграрни науки, горско стопанство и аквакултури | 24 | 27 | 28 | 9 | 32 | 32 | 23 | 25 | 19 | 15 | 20 | 24 | 27 | 34 | 42 | 35 |
| 19 | Ветеринарна медицина | 3 | 1 | - | 2 | 4 | 5 | 4 | 1 | 7 | 5 | 4 | 8 | 10 | 13 | 19 | 8 |
| 20 | Здравеопазване | 54 | 50 | 55 | 40 | 74 | 77 | 74 | 65 | 63 | 54 | 63 | 91 | 173 | 162 | 198 | 277 |
| 21 | Социални дейности | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | - |
| 22 | Спорт, туризъм и хотелиерство | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | - |
| 23 | Управление на сухопътен, воден и въздушен тран... | 2 | 1 | 3 | 6 | 4 | 5 | - | 1 | - | - | - | 4 | 9 | 11 | 14 | 7 |
| 24 | Опазване на околната среда | 2 | 2 | 3 | 7 | 5 | 5 | 8 | 7 | 11 | 7 | 10 | 19 | 18 | 26 | 18 | 16 |
| 25 | Сигурност и безопасност | 5 | 8 | 6 | 3 | 7 | 2 | 12 | 9 | 12 | 20 | 10 | 16 | 21 | 20 | 26 | 35 |
| 26 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 27 | Бележки | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 28 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 29 | Легенда | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 30 | По колони | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 31 | NaN | Мерни единици | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 32 | NaN | Времева разбивка | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 33 | NaN | Пол | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 34 | По редове | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 35 | NaN | Тесни области на образование | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 36 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 37 | Дата на изготвяне на справката: 03/07/2021 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
df3_clean = clean_data(df3)
df4 = pd.read_excel('phd_finish 2017-2020.xlsx')
df4
D:\anaconda3\envs\Binance\lib\site-packages\openpyxl\styles\stylesheet.py:221: UserWarning: Workbook contains no default style, apply openpyxl's default
| ЗАВЪРШИЛИ С ОБРАЗОВАТЕЛНА И НАУЧНА СТЕПЕН „ДОКТОР" ПО ПОЛ И ОБЛАСТ НА ОБРАЗОВАНИЕТО (КОО-2015) - ДАННИ ОТ 2017 ГОДИНА | Unnamed: 1 | Unnamed: 2 | Unnamed: 3 | Unnamed: 4 | |
|---|---|---|---|---|---|
| 0 | NaN | (брой) | (брой) | (брой) | (брой) |
| 1 | NaN | 2017 | 2018 | 2019 | 2020 |
| 2 | NaN | Общо | Общо | Общо | Общо |
| 3 | Общо | 1423 | 1365 | 1285 | 1097 |
| 4 | Образование | 111 | 140 | 124 | 112 |
| 5 | Изкуства | 77 | 74 | 75 | 66 |
| 6 | Хуманитарни науки (без езици) | 64 | 62 | 75 | 51 |
| 7 | Езици | 65 | 42 | 56 | 39 |
| 8 | Науки за обществото и човешкото поведение | 181 | 183 | 184 | 157 |
| 9 | Журналистика и информация | 35 | 53 | 48 | 49 |
| 10 | Бизнес и администрация | 104 | 129 | 112 | 92 |
| 11 | Право | 37 | 41 | 44 | 32 |
| 12 | Биологически науки и сродни на тях | 77 | 61 | 56 | 52 |
| 13 | Околна среда | 9 | 5 | 9 | 10 |
| 14 | Физически, химически и науки за Земята | 115 | 84 | 71 | 59 |
| 15 | Математика и статистика | 14 | 21 | 14 | 7 |
| 16 | Информационно-комуникационни технологии | 39 | 27 | 23 | 20 |
| 17 | Технически науки и технически професии | 127 | 116 | 118 | 77 |
| 18 | Добив и производствени технологии | 33 | 21 | 16 | 14 |
| 19 | Архитектура и строителство | 33 | 16 | 18 | 15 |
| 20 | Технически науки, производство и строителство,... | - | - | 1 | - |
| 21 | Селско стопанство | 29 | 22 | 27 | 12 |
| 22 | Горско стопанство | 7 | 4 | 3 | 7 |
| 23 | Рибно стопанство | - | - | 1 | 1 |
| 24 | Ветеринарна медицина | 8 | 8 | 7 | 5 |
| 25 | Здравеопазване | 173 | 163 | 137 | 156 |
| 26 | Социални услуги | - | 3 | 4 | 8 |
| 27 | Здравеопазване и социални услуги, интердисципл... | 39 | 35 | 28 | 21 |
| 28 | Комунално-санитарни услуги, здраве и безопасно... | 9 | 7 | - | 9 |
| 29 | Сигурност | 23 | 43 | 28 | 22 |
| 30 | Транспортни услуги | 14 | 5 | 6 | 4 |
| 31 | NaN | NaN | NaN | NaN | NaN |
| 32 | Бележки | NaN | NaN | NaN | NaN |
| 33 | NaN | NaN | NaN | NaN | NaN |
| 34 | Легенда | NaN | NaN | NaN | NaN |
| 35 | По колони | NaN | NaN | NaN | NaN |
| 36 | NaN | Мерни единици | NaN | NaN | NaN |
| 37 | NaN | Времева разбивка | NaN | NaN | NaN |
| 38 | NaN | Пол | NaN | NaN | NaN |
| 39 | По редове | NaN | NaN | NaN | NaN |
| 40 | NaN | Тесни области на образование | NaN | NaN | NaN |
| 41 | NaN | NaN | NaN | NaN | NaN |
| 42 | Дата на изготвяне на справката: 03/07/2021 | NaN | NaN | NaN | NaN |
df4_clean = clean_data(df4)
df_final_finish = pd.concat([df3_clean, df4_clean], axis = 0).fillna(0)
df_final_finish.reset_index(drop=False, inplace = True)
df_final.columns = [str(column) + "_Start" for column in df_final.columns]
df_final_finish.columns = [str(column) + "_End" for column in df_final_finish.columns]
df_final['index_End'] = (df_final['index_Start'].astype(int) + 3).astype(str)
df_merge = pd.merge(df_final,df_final_finish, on = 'index_End', how = 'left')
model_data = df_merge[['index_Start', 'Общо_Start', 'Общо_End']].set_index('index_Start')
model_data
| Общо_Start | Общо_End | |
|---|---|---|
| index_Start | ||
| 2000 | 3414 | 401.0 |
| 2001 | 3998 | 392.0 |
| 2002 | 4440 | 528.0 |
| 2003 | 4834 | 583.0 |
| 2004 | 5079 | 621.0 |
| 2005 | 5163 | 601.0 |
| 2006 | 4816 | 636.0 |
| 2007 | 4353 | 596.0 |
| 2008 | 3949 | 638.0 |
| 2009 | 3850 | 979.0 |
| 2010 | 4095 | 1202.0 |
| 2011 | 4703 | 1363.0 |
| 2012 | 5371 | 1442.0 |
| 2013 | 6055 | 1464.0 |
| 2014 | 6617 | 1423.0 |
| 2015 | 6750 | 1365.0 |
| 2016 | 6738 | 1285.0 |
| 2017 | 6564 | 1097.0 |
| 2018 | 6467 | NaN |
| 2019 | 6440 | NaN |
| 2020 | 6570 | NaN |
model_data[:'2017'].corr()
| Общо_Start | Общо_End | |
|---|---|---|
| Общо_Start | 1.000000 | 0.665142 |
| Общо_End | 0.665142 | 1.000000 |
import plotly.graph_objects as go
fig = go.Figure()
fig.add_trace(go.Scatter(x=model_data[:'2017'].index, y=model_data[:'2017']['Общо_Start'],
mode='lines',
name='lines'))
fig.add_trace(go.Scatter(x=model_data[:'2017'].index, y=model_data[:'2017']['Общо_End'],
mode='lines',
name='lines'))
train,test = model_data[:'2017'],model_data['2018':]
X_train,Y_train, X_test, Y_test = train['Общо_Start'],train['Общо_End'],test['Общо_Start'],test['Общо_End']
from sklearn.linear_model import LinearRegression
reg = LinearRegression().fit(X_train.values.reshape(-1, 1), Y_train.values.reshape(-1, 1))
reg.score(X_train.values.reshape(-1, 1), Y_train.values.reshape(-1, 1))
0.4424134969996403
y = df_merge['Общо_End']
X = df_merge.filter(regex='Start').drop(columns = ['Общо_Start'])
X = X.replace("-",0).astype(int)
y = y.replace("-",0)
X_train,X_test = X[:-3],X[-3:]
Y_train,Y_test = y[:-3],y[-3:]
reg = LinearRegression(normalize=True,
positive = True,
fit_intercept = True).fit(X_train.values, Y_train.values.reshape(-1, 1))
reg.score(X_train.values, Y_train.values.reshape(-1, 1))
0.9694366979430711
from sklearn.metrics import mean_absolute_percentage_error
mean_absolute_percentage_error(Y_train,reg.predict(X_train.values))
0.06466520656011816
reg.coef_[0]
array([3.94433539e+01, 0.00000000e+00, 0.00000000e+00, 2.08887617e-01,
0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00,
0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00,
0.00000000e+00, 4.09839513e+00, 0.00000000e+00, 0.00000000e+00,
4.17794344e+00, 0.00000000e+00, 0.00000000e+00, 2.20996276e+02,
5.52936393e+00, 0.00000000e+00, 8.85589039e-02, 5.34421416e-01,
0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00,
0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00,
0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00,
0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00,
0.00000000e+00, 0.00000000e+00])
# get importance
importance = reg.coef_[0]
fig = go.Figure([go.Bar(x=X_train.columns, y=reg.coef_[0])])
fig.update_yaxes(type="log")
fig.show()
fig = go.Figure()
fig.add_trace(go.Scatter(x=X_train['index_Start'], y=Y_train,
mode='lines',
name='X_train'))
fig.add_trace(go.Scatter(x=X_train['index_Start'], y=[x[0] for x in reg.predict(X_train.values)],
mode='lines',
name='X_predicted'))
reg.predict(X_test)
array([[1106.47999546],
[1090.18833375],
[1158.70031552]])
reg.predict(X_test).sum() / model_data.iloc[-3:]['Общо_Start'].values.sum()
0.1722733811537326
import statsmodels.api as sm
model = sm.OLS(Y_train, X_train)
results = model.fit(normalize=True,
positive = True,
fit_intercept = True)
results.summary()
D:\anaconda3\envs\Binance\lib\site-packages\scipy\stats\stats.py:1604: UserWarning: kurtosistest only valid for n>=20 ... continuing anyway, n=18 D:\anaconda3\envs\Binance\lib\site-packages\statsmodels\regression\linear_model.py:1728: RuntimeWarning: divide by zero encountered in true_divide D:\anaconda3\envs\Binance\lib\site-packages\statsmodels\regression\linear_model.py:1729: RuntimeWarning: invalid value encountered in double_scalars D:\anaconda3\envs\Binance\lib\site-packages\statsmodels\regression\linear_model.py:1650: RuntimeWarning: divide by zero encountered in double_scalars D:\anaconda3\envs\Binance\lib\site-packages\statsmodels\base\model.py:1452: RuntimeWarning: invalid value encountered in multiply
| Dep. Variable: | Общо_End | R-squared: | 1.000 |
|---|---|---|---|
| Model: | OLS | Adj. R-squared: | nan |
| Method: | Least Squares | F-statistic: | nan |
| Date: | Sun, 04 Jul 2021 | Prob (F-statistic): | nan |
| Time: | 16:55:35 | Log-Likelihood: | 461.93 |
| No. Observations: | 18 | AIC: | -887.9 |
| Df Residuals: | 0 | BIC: | -871.8 |
| Df Model: | 17 | ||
| Covariance Type: | nonrobust |
| coef | std err | t | P>|t| | [0.025 | 0.975] | |
|---|---|---|---|---|---|---|
| index_Start | -0.0320 | inf | -0 | nan | nan | nan |
| Подготовка на учители и науки за образованието_Start | 0.4332 | inf | 0 | nan | nan | nan |
| Изкуства_Start | 1.0899 | inf | 0 | nan | nan | nan |
| Хуманитарни науки_Start | 1.6165 | inf | 0 | nan | nan | nan |
| Науки за обществото и човешкото поведение_Start | -2.5177 | inf | -0 | nan | nan | nan |
| Журналистика, масова комуникация и информация_Start | 4.6852 | inf | 0 | nan | nan | nan |
| Стопански науки и администрация_Start | 0.5146 | inf | 0 | nan | nan | nan |
| Право_Start | 1.9937 | inf | 0 | nan | nan | nan |
| Природни науки_Start | -1.6375 | inf | -0 | nan | nan | nan |
| Физически и химически науки_Start | -1.8949 | inf | -0 | nan | nan | nan |
| Математика и статистика_Start | -4.3385 | inf | -0 | nan | nan | nan |
| Информатика_Start | 0.3657 | inf | 0 | nan | nan | nan |
| Технически науки и технически професии_Start | 0.5425 | inf | 0 | nan | nan | nan |
| Добив и производствени технологии_Start | 7.5148 | inf | 0 | nan | nan | nan |
| Архитектура и строителство_Start | 0.7051 | inf | 0 | nan | nan | nan |
| Аграрни науки, горско стопанство и аквакултури_Start | -4.5611 | inf | -0 | nan | nan | nan |
| Ветеринарна медицина_Start | 0.4641 | inf | 0 | nan | nan | nan |
| Здравеопазване_Start | 1.1526 | inf | 0 | nan | nan | nan |
| Социални дейности_Start | -0.3244 | inf | -0 | nan | nan | nan |
| Спорт, туризъм и хотелиерство_Start | 0.1357 | inf | 0 | nan | nan | nan |
| Управление на сухопътен, воден и въздушен транспорт_Start | 3.9745 | inf | 0 | nan | nan | nan |
| Опазване на околната среда_Start | -1.7527 | inf | -0 | nan | nan | nan |
| Сигурност и безопасност_Start | 1.9679 | inf | 0 | nan | nan | nan |
| Образование_Start | 0.6194 | inf | 0 | nan | nan | nan |
| Хуманитарни науки (без езици)_Start | 0.4031 | inf | 0 | nan | nan | nan |
| Езици_Start | 0.2749 | inf | 0 | nan | nan | nan |
| Журналистика и информация_Start | 0.2054 | inf | 0 | nan | nan | nan |
| Бизнес и администрация_Start | 0.6465 | inf | 0 | nan | nan | nan |
| Биологически науки и сродни на тях_Start | 0.2793 | inf | 0 | nan | nan | nan |
| Околна среда_Start | 0.0565 | inf | 0 | nan | nan | nan |
| Физически, химически и науки за Земята_Start | 0.3694 | inf | 0 | nan | nan | nan |
| Информационно-комуникационни технологии_Start | 0.2217 | inf | 0 | nan | nan | nan |
| Технически науки, производство и строителство, интердисциплинарни програми и квалификации_Start | 0.0076 | inf | 0 | nan | nan | nan |
| Селско стопанство_Start | 0.1097 | inf | 0 | nan | nan | nan |
| Горско стопанство_Start | 0.0293 | inf | 0 | nan | nan | nan |
| Рибно стопанство_Start | 0.0033 | inf | 0 | nan | nan | nan |
| Социални услуги_Start | 0.0250 | inf | 0 | nan | nan | nan |
| Здравеопазване и социални услуги, интердисциплинарни програми и квалификации_Start | 0.0978 | inf | 0 | nan | nan | nan |
| Персонални услуги_Start | 0 | nan | nan | nan | nan | nan |
| Комунално-санитарни услуги, здраве и безопасност при работа_Start | 0.0206 | inf | 0 | nan | nan | nan |
| Сигурност_Start | 0.1163 | inf | 0 | nan | nan | nan |
| Транспортни услуги_Start | 0.0489 | inf | 0 | nan | nan | nan |
| Omnibus: | 1.939 | Durbin-Watson: | 0.407 |
|---|---|---|---|
| Prob(Omnibus): | 0.379 | Jarque-Bera (JB): | 1.590 |
| Skew: | 0.626 | Prob(JB): | 0.451 |
| Kurtosis: | 2.256 | Cond. No. | 740. |